Skip to main content
Cornell University
We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors. Donate
arxiv logo > cs.SE

Help | Advanced Search

arXiv logo
Cornell University Logo

quick links

  • Login
  • Help Pages
  • About

Software Engineering

  • New submissions
  • Cross-lists
  • Replacements

See recent articles

Showing new listings for Tuesday, 3 June 2025

Total of 59 entries
Showing up to 2000 entries per page: fewer | more | all

New submissions (showing 23 of 23 entries)

[1] arXiv:2506.00128 [pdf, html, other]
Title: Applying Large Language Models to Issue Classification: Revisiting with Extended Data and New Models
Gabriel Aracena, Kyle Luster, Fabio Santos, Igor Steinmacher, Marco A. Gerosa
Comments: 35 pages, 2 figures, 9 tables, Pre-print for Science of Computer Programming
Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG)

Effective prioritization of issue reports in software engineering helps to optimize resource allocation and information recovery. However, manual issue classification is laborious and lacks scalability. As an alternative, many open source software (OSS) projects employ automated processes for this task, yet this method often relies on large datasets for adequate training. Traditionally, machine learning techniques have been used for issue classification. More recently, large language models (LLMs) have emerged as powerful tools for addressing a range of software engineering challenges, including code and test generation, mapping new requirements to legacy software endpoints, and conducting code reviews. The following research investigates an automated approach to issue classification based on LLMs. By leveraging the capabilities of such models, we aim to develop a robust system for prioritizing issue reports, mitigating the necessity for extensive training data while also maintaining reliability in classification. In our research, we developed an LLM-based approach for accurately labeling issues by selecting two of the most prominent large language models. We then compared their performance across multiple datasets. Our findings show that GPT-4o achieved the best results in classifying issues from the NLBSE 2024 competition. Moreover, GPT-4o outperformed DeepSeek R1, achieving an F1 score 20% higher when both models were trained on the same dataset from the NLBSE 2023 competition, which was ten times larger than the NLBSE 2024 dataset. The fine-tuned GPT-4o model attained an average F1 score of 80.7%, while the fine-tuned DeepSeek R1 model achieved 59.33%. Increasing the dataset size did not improve the F1 score, reducing the dependence on massive datasets for building an efficient solution to issue classification.

[2] arXiv:2506.00142 [pdf, html, other]
Title: Understanding Underrepresented Groups in Open Source Software
Reydne Santos, Rafa Prado, Ana Paula de Holanda Silva, Kiev Gama, Fernando Castor, Ronnie de Souza Santos
Subjects: Software Engineering (cs.SE)

Context: Diversity can impact team communication, productivity, cohesiveness, and creativity. Analyzing the existing knowledge about diversity in open source software (OSS) projects can provide directions for future research and raise awareness about barriers and biases against underrepresented groups in OSS. Objective: This study aims to analyze the knowledge about minority groups in OSS projects. We investigated which groups were studied in the OSS literature, the study methods used, their implications, and their recommendations to promote the inclusion of minority groups in OSS projects. Method: To achieve this goal, we performed a systematic literature review study that analyzed 42 papers that directly study underrepresented groups in OSS projects. Results: Most papers focus on gender (62.3%), while others like age or ethnicity are rarely studied. The neurodiversity dimension, have not been studied in the context of OSS. Our results also reveal that diversity in OSS projects faces several barriers but brings significant benefits, such as promoting safe and welcoming environments. Conclusion: Most analyzed papers adopt a myopic perspective that sees gender as strictly binary. Dimensions of diversity that affect how individuals interact and function in an OSS project, such as age, tenure, and ethnicity, have received very little attention.

[3] arXiv:2506.00150 [pdf, html, other]
Title: Supporting architecture evaluation for ATAM scenarios with LLMs
Rafael Capilla, J. Andrés Díaz-Pace, Yamid Ramírez, Jennifer Pérez, Vanessa Rodríguez-Horcajo
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

Architecture evaluation methods have long been used to evaluate software designs. Several evaluation methods have been proposed and used to analyze tradeoffs between different quality attributes. Having competing qualities leads to conflicts for selecting which quality-attribute scenarios are the most suitable ones that an architecture should tackle and for prioritizing the scenarios required by the stakeholders. In this context, architecture evaluation is carried out manually, often involving long brainstorming sessions to decide which are the most adequate quality scenarios. To reduce this effort and make the assessment and selection of scenarios more efficient, we suggest the usage of LLMs to partially automate evaluation activities. As a first step to validate this hypothesis, this work studies MS Copilot as an LLM tool to analyze quality scenarios suggested by students in a software architecture course and compares the students' results with the assessment provided by the LLM. Our initial study reveals that the LLM produces in most cases better and more accurate results regarding the risks, sensitivity points and tradeoff analysis of the quality scenarios. Overall, the use of generative AI has the potential to partially automate and support the architecture evaluation tasks, improving the human decision-making process.

[4] arXiv:2506.00296 [pdf, html, other]
Title: CRScore++: Reinforcement Learning with Verifiable Tool and AI Feedback for Code Review
Manav Nitin Kapadnis, Atharva Naik, Carolyn Rose
Subjects: Software Engineering (cs.SE)

Reinforcement learning (RL) to improve code review comment generation requires handling unstructured outputs, making reinforcement learning (RL) feedback challenging. The two main RL approaches, namely RL with Verifiable Feedback (RLVR) and RL with AI Feedback (RLAIF), offer trade-offs: RLVR provides reliable feedback for structured tasks like code generation, while RLAIF works for unstructured outputs but is subjective. We bridge this gap with CRScore++, an RL framework that leverages both LLM-based subjective feedback and verifiable signals for training. Extending CRScore, a code review evaluation metric integrating LLMs with verifiers like linters and code smell detectors, CRScore++ transforms these signals into training rewards. We show that CRScore++ improves a weaker student model through a combination of supervised fine-tuning and RL critique from a stronger teacher model, thus enabling generalization to novel programming languages.

[5] arXiv:2506.00520 [pdf, html, other]
Title: Temac: Multi-Agent Collaboration for Automated Web GUI Testing
Chenxu Liu, Zhiyu Gu, Guoquan Wu, Ying Zhang, Jun Wei, Tao Xie
Subjects: Software Engineering (cs.SE)

Quality assurance of web applications is critical, as web applications play an essential role in people's daily lives. To reduce labor costs, automated web GUI testing (AWGT) is widely adopted, exploring web applications via GUI actions such as clicks and text inputs. However, these approaches face limitations in generating continuous and meaningful action sequences capable of covering complex functionalities. Recent work incorporates large language models (LLMs) for GUI testing. However, these approaches face various challenges, including low efficiency of LLMs, high complexity of rich web application contexts, and a low success rate of LLMs in executing GUI tasks.
To address these challenges, in this paper, we propose Temac, an approach that enhances AWGT using LLM-based multi-agent collaboration to increase code coverage. Temac is motivated by our insight that LLMs can enhance AWGT in executing complex functionalities, while the information discovered during AWGT can, in turn, be provided as the domain knowledge to improve the LLM-based task execution. Specifically, given a web application, Temac initially runs an existing approach to broadly explore application states. When the testing coverage stagnates, Temac then employs LLM-based agents to summarize the collected information to form a knowledge base and to infer not-covered functionalities. Guided by this knowledge base, Temac finally uses specialized LLM-based agents to target and execute the not-covered functionalities, reaching deeper states beyond those explored by the existing approach.
Our evaluation results show that Temac exceeds state-of-the-art approaches from 12.5% to 60.3% on average code coverage on six complex open-source web applications, while revealing 445 unique failures in the top 20 real-world web applications. These results strongly demonstrate the effectiveness and the general applicability of Temac.

[6] arXiv:2506.00682 [pdf, other]
Title: Encouraging Students' Responsible Use of GenAI in Software Engineering Education: A Causal Model and Two Institutional Applications
Vahid Garousi, Zafar Jafarov, Aytan Movsumova, Atif Namazov, Huseyn Mirzayev
Subjects: Software Engineering (cs.SE)

Context: As generative AI (GenAI) tools such as ChatGPT and GitHub Copilot become pervasive in education, concerns are rising about students using them to complete rather than learn from coursework-risking overreliance, reduced critical thinking, and long-term skill deficits.
Objective: This paper proposes and empirically applies a causal model to help educators scaffold responsible GenAI use in Software Engineering (SE) education. The model identifies how professor actions, student factors, and GenAI tool characteristics influence students' usage of GenAI tools.
Method: Using a design-based research approach, we applied the model in two contexts: (1) revising four extensive lab assignments of a final-year Software Testing course at Queen's University Belfast (QUB), and (2) embedding GenAI-related competencies into the curriculum of a newly developed SE BSc program at Azerbaijan Technical University (AzTU). Interventions included GenAI usage declarations, output validation tasks, peer-review of AI artifacts, and career-relevant messaging.
Results: In the course-level case, instructor observations and student artifacts indicated increased critical engagement with GenAI, reduced passive reliance, and improved awareness of validation practices. In the curriculum-level case, the model guided integration of GenAI learning outcomes across multiple modules and levels, enabling longitudinal scaffolding of AI literacy.
Conclusion: The causal model served as both a design scaffold and a reflection tool. It helped align GenAI-related pedagogy with SE education goals and can offer a useful framework for instructors and curriculum designers navigating the challenges of GenAI-era education.

[7] arXiv:2506.00693 [pdf, html, other]
Title: Human Attention During Localization of Memory Bugs in C Programs
Emory Smith, Robert Wallace, Matthew Robison, Yu Huang, Collin McMillan
Comments: 12 pages, 12 figures
Subjects: Software Engineering (cs.SE)

This paper presents a study of human visual attention during localization of memory bugs in C. Human visual attention refers to the mechanical processes by which we selectively process and prioritize information. Visual attention is important to study because it is central to what information people (who are sighted) use to solve a particular problem. Meanwhile, memory bugs are among the most common types of bugs in C programs that manifest as a variety of program faults. In this paper, we study human visual attention while people attempt to locate memory bugs in code. We recruit 11 programmers to locate between one and six memory bugs in three C programs for 1.5-2 hours each. In total we collected observations of 17 hours of programmer effort. The bugs in our study cover memory leaks, overflows, and double frees, which are among the most common memory bugs. We analyze the task outcomes in terms of success rate and related factors, patterns of visual attention overall such as what lines and functions are read, and finally we explore differences of visual attention patterns during success versus failure cases.

[8] arXiv:2506.00714 [pdf, html, other]
Title: An LLM Agent for Functional Bug Detection in Network Protocols
Mingwei Zheng, Chengpeng Wang, Xuwei Liu, Jinyao Guo, Shiwei Feng, Xiangyu Zhang
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

Functional correctness is critical for ensuring the reliability and security of network protocol implementations. Functional bugs, instances where implementations diverge from behaviors specified in RFC documents, can lead to severe consequences, including faulty routing, authentication bypasses, and service disruptions. Detecting these bugs requires deep semantic analysis across specification documents and source code, a task beyond the capabilities of traditional static analysis tools. This paper introduces RFCScan, an autonomous agent that leverages large language models (LLMs) to detect functional bugs by checking conformance between network protocol implementations and their RFC specifications. Inspired by the human auditing procedure, RFCScan comprises two key components: an indexing agent and a detection agent. The former hierarchically summarizes protocol code semantics, generating semantic indexes that enable the detection agent to narrow down the scanning scope. The latter employs demand-driven retrieval to iteratively collect additional relevant data structures and functions, eventually identifying potential inconsistencies with the RFC specifications effectively. We evaluate RFCScan across six real-world network protocol implementations. RFCScan identifies 47 functional bugs with 81.9% precision, of which 20 bugs have been confirmed or fixed by developers.

[9] arXiv:2506.00750 [pdf, html, other]
Title: CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning
Monoshi Kumar Roy, Simin Chen, Benjamin Steenhoek, Jinjun Peng, Gail Kaiser, Baishakhi Ray, Wei Le
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

Understanding and reasoning about code semantics is essential for enhancing code LLMs' abilities to solve real-world software engineering (SE) tasks. Although several code reasoning benchmarks exist, most rely on synthetic datasets or educational coding problems and focus on coarse-grained reasoning tasks such as input/output prediction, limiting their effectiveness in evaluating LLMs in practical SE contexts. To bridge this gap, we propose CodeSense, the first benchmark that makes available a spectrum of fine-grained code reasoning tasks concerned with the software engineering of real-world code. We collected Python, C and Java software projects from real-world repositories. We executed tests from these repositories, collected their execution traces, and constructed a ground truth dataset for fine-grained semantic reasoning tasks. We then performed comprehensive evaluations on state-of-the-art LLMs. Our results show a clear performance gap for the models to handle fine-grained reasoning tasks. Although prompting techniques such as chain-of-thought and in-context learning helped, the lack of code semantics in LLMs fundamentally limit models' capabilities of code reasoning. Besides dataset, benchmark and evaluation, our work produced an execution tracing framework and tool set that make it easy to collect ground truth for fine-grained SE reasoning tasks, offering a strong basis for future benchmark construction and model post training. Our code and data are located at this https URL.

[10] arXiv:2506.00788 [pdf, html, other]
Title: Behavioral Augmentation of UML Class Diagrams: An Empirical Study of Large Language Models for Method Generation
Djaber Rouabhia, Ismail Hadjadj
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

Automating the enrichment of UML class diagrams with behavioral methods from natural language use cases is a significant challenge. This study evaluates nine large language models (LLMs) in augmenting a methodless UML diagram (21 classes, 17 relationships) using 21 structured waste-management use cases. A total of 90 diagrams (3,373 methods) were assessed across six metrics: method quantity, signature richness (visibility, names, parameters, return types), annotation completeness (linking to use cases/actions), structural fidelity, syntactic correctness (PlantUML compilation), and naming convergence (across models). All LLMs produced valid PlantUML diagrams adhering to UML conventions. Some models excelled in method coverage and annotation accuracy, while others showed richer parameterization but weaker traceability. These results demonstrate that LLMs can generate well-structured methods with consistent naming, advancing automated behavioral modeling. However, inconsistencies in annotations and signatures highlight the need for improved prompt engineering and model selection. The rapid generation of these methods supports Agile practices by enabling faster design iterations. Despite their capabilities, human oversight is essential to ensure accuracy, appropriateness, and semantic alignment. This positions LLMs as collaborative partners in software design. All experimental artifacts (\texttt{.puml}, \texttt{.png}, \texttt{.csv}) are publicly available for reproducibility.

[11] arXiv:2506.00888 [pdf, html, other]
Title: An Integrated Platform for LEED Certification Automation Using Computer Vision and LLM-RAG
Jooyeol Lee
Subjects: Software Engineering (cs.SE)

The Leadership in Energy and Environmental Design (LEED) certification process is characterized by labor-intensive requirements for data handling, simulation, and documentation. This paper presents an automated platform designed to streamline key aspects of LEED certification. The platform integrates a PySide6-based user interface, a review Manager for process orchestration, and multiple analysis engines for credit compliance, energy modeling via EnergyPlus, and location-based evaluation. Key components include an OpenCV-based preprocessing pipeline for document analysis and a report generation module powered by the Gemma3 large language model with a retrieval-augmented generation framework. Implementation techniques - including computer vision for document analysis, structured LLM prompt design, and RAG-based report generation - are detailed. Initial results from pilot project deployment show improvements in efficiency and accuracy compared to traditional manual workflows, achieving 82% automation coverage and up to 70% reduction in documentation time. The platform demonstrates practical scalability for green building certification automation.

[12] arXiv:2506.00894 [pdf, html, other]
Title: CODEMENV: Benchmarking Large Language Models on Code Migration
Keyuan Cheng, Xudong Shen, Yihao Yang, Tengyue Wang, Yang Cao, Muhammad Asif Ali, Hanbin Wang, Lijie Hu, Di Wang
Comments: Accepted by ACL 2025 Findings
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Large language models (LLMs) have shown remarkable capabilities across various software engineering tasks; however, their effectiveness in code migration, adapting code to run in different environments, remains insufficiently studied. In this work, we introduce CODEMENV: Code Migration Across Environment, a new benchmark specifically designed to assess LLMs' abilities in code migration scenarios. CODEMENV consists of 922 examples spanning 19 Python and Java packages, and covers three core tasks: (1) identifying functions incompatible with specific versions, (2) detecting changes in function definitions, and (3) adapting code to target environments. Experimental evaluation with seven LLMs on CODEMENV yields an average pass@1 rate of 26.50%, with GPT-4O achieving the highest score at 43.84%. Key findings include: (i) LLMs tend to be more proficient with newer function versions, which aids in migrating legacy code, and (ii) LLMs sometimes exhibit logical inconsistencies by identifying function changes irrelevant to the intended migration environment. The datasets are available at this https URL.

[13] arXiv:2506.00943 [pdf, html, other]
Title: Legal Compliance Evaluation of Smart Contracts Generated By Large Language Models
Chanuka Wijayakoon, Hai Dong, H.M.N. Dilum Bandara, Zahir Tari, Anurag Soin
Comments: Accepted for publication at IEEE International Conference on Blockchain and Cryptocurrency (ICBC) 2025
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

Smart contracts can implement and automate parts of legal contracts, but ensuring their legal compliance remains challenging. Existing approaches such as formal specification, verification, and model-based development require expertise in both legal and software development domains, as well as extensive manual effort. Given the recent advances of Large Language Models (LLMs) in code generation, we investigate their ability to generate legally compliant smart contracts directly from natural language legal contracts, addressing these challenges. We propose a novel suite of metrics to quantify legal compliance based on modeling both legal and smart contracts as processes and comparing their behaviors. We select four LLMs, generate 20 smart contracts based on five legal contracts, and analyze their legal compliance. We find that while all LLMs generate syntactically correct code, there is significant variance in their legal compliance with larger models generally showing higher levels of compliance. We also evaluate the proposed metrics against properties of software metrics, showing they provide fine-grained distinctions, enable nuanced comparisons, and are applicable across domains for code from any source, LLM or developer. Our results suggest that LLMs can assist in generating starter code for legally compliant smart contracts with strict reviews, and the proposed metrics provide a foundation for automated and self-improving development workflows.

[14] arXiv:2506.01249 [pdf, html, other]
Title: SysLLMatic: Large Language Models are Software System Optimizers
Huiyun Peng, Arjun Gupte, Ryan Hasler, Nicholas John Eliopoulos, Chien-Chou Ho, Rishi Mantri, Leo Deng, Konstantin Läufer, George K. Thiruvathukal, James C. Davis
Subjects: Software Engineering (cs.SE); Performance (cs.PF)

Automatic software system optimization can improve software speed, reduce operating costs, and save energy. Traditional approaches to optimization rely on manual tuning and compiler heuristics, limiting their ability to generalize across diverse codebases and system contexts. Recent methods using Large Language Models (LLMs) offer automation to address these limitations, but often fail to scale to the complexity of real-world software systems and applications. We present SysLLMatic, a system that integrates LLMs with profiling-guided feedback and system performance insights to automatically optimize software code. We evaluate it on three benchmark suites: HumanEval_CPP (competitive programming in C++), SciMark2 (scientific kernels in Java), and DaCapoBench (large-scale software systems in Java). Results show that SysLLMatic can improve system performance, including latency, throughput, energy efficiency, memory usage, and CPU utilization. It consistently outperforms state-of-the-art LLM baselines on microbenchmarks. On large-scale application codes, it surpasses traditional compiler optimizations, achieving average relative improvements of 1.85x in latency and 2.24x in throughput. Our findings demonstrate that LLMs, guided by principled systems thinking and appropriate performance diagnostics, can serve as viable software system optimizers. We further identify limitations of our approach and the challenges involved in handling complex applications. This work provides a foundation for generating optimized code across various languages, benchmarks, and program sizes in a principled manner.

[15] arXiv:2506.01342 [pdf, html, other]
Title: An Accurate and Efficient Vulnerability Propagation Analysis Framework
Bonan Ruan, Zhiwei Lin, Jiahao Liu, Chuqi Zhang, Kaihang Ji, Zhenkai Liang
Subjects: Software Engineering (cs.SE); Cryptography and Security (cs.CR)

Identifying the impact scope and scale is critical for software supply chain vulnerability assessment. However, existing studies face substantial limitations. First, prior studies either work at coarse package-level granularity, producing many false positives, or fail to accomplish whole-ecosystem vulnerability propagation analysis. Second, although vulnerability assessment indicators like CVSS characterize individual vulnerabilities, no metric exists to specifically quantify the dynamic impact of vulnerability propagation across software supply chains. To address these limitations and enable accurate and comprehensive vulnerability impact assessment, we propose a novel approach: (i) a hierarchical worklist-based algorithm for whole-ecosystem and call-graph-level vulnerability propagation analysis and (ii) the Vulnerability Propagation Scoring System (VPSS), a dynamic metric to quantify the scope and evolution of vulnerability impacts in software supply chains. We implement a prototype of our approach in the Java Maven ecosystem and evaluate it on 100 real-world vulnerabilities. Experimental results demonstrate that our approach enables effective ecosystem-wide vulnerability propagation analysis, and provides a practical, quantitative measure of vulnerability impact through VPSS.

[16] arXiv:2506.01427 [pdf, html, other]
Title: Forcrat: Automatic I/O API Translation from C to Rust via Origin and Capability Analysis
Jaemin Hong, Sukyoung Ryu
Comments: 12 pages, 3 figures, 3 tables
Subjects: Software Engineering (cs.SE)

Translating C to Rust is a promising way to enhance the reliability of legacy system programs. Although the industry has developed an automatic C-to-Rust translator, C2Rust, its translation remains unsatisfactory. One major reason is that C2Rust retains C standard library (libc) function calls instead of replacing them with functions from the Rust standard library (Rust std). However, little work has been done on replacing library functions in C2Rust-generated code. In this work, we focus on replacing the I/O API, an important subset of library functions. This poses challenges due to the semantically different designs of I/O APIs in libc and Rust std. First, the two APIs offer different sets of types that represent the origins (e.g., standard input, files) and capabilities (e.g., read, write) of streams used for I/O. Second, they use different error-checking mechanisms: libc uses internal indicators, while Rust std uses return values. To address these challenges, we propose two static analysis techniques, origin and capability analysis and error source analysis, and use their results to replace the I/O API. Our evaluation shows that the proposed approach is (1) correct, with all 32 programs that have test suites passing the tests after transformation, (2) efficient, analyzing and transforming 422k LOC in 14 seconds, and (3) widely applicable, replacing 82% of I/O API calls.

[17] arXiv:2506.01481 [pdf, html, other]
Title: AidAI: Automated Incident Diagnosis for AI Workloads in the Cloud
Yitao Yang, Yangtao Deng, Yifan Xiong, Baochun Li, Hong Xu, Peng Cheng
Subjects: Software Engineering (cs.SE)

AI workloads experience frequent incidents due to intensive hardware utilization and extended training times. The current incident management workflow is provider-centric, where customers report incidents and place the entire troubleshooting responsibility on the infrastructure provider. However, the inherent knowledge gap between customer and provider significantly impacts incident resolution efficiency. In AI infrastructure, incidents may take several days on average to mitigate, resulting in delays and productivity losses. To address these issues, we present AidAI, a customer-centric system that provides immediate incident diagnosis for customers and streamlines the creation of incident tickets for unresolved issues. The key idea of AidAI is to construct internal knowledge bases from historical on-call experiences during the offline phase and mimic the reasoning process of human experts to diagnose incidents through trial and error in the online phase. Evaluations using real-world incident records in Microsoft show that AidAI achieves an average Micro F1 score of 0.854 and Macro F1 score of 0.816 without significant overhead.

[18] arXiv:2506.01492 [pdf, html, other]
Title: Challenges in designing research infrastructure software in multi-stakeholder contexts
Stephan Druskat, Sabine Theis
Comments: 19 pages, 6 figures, accepted to be presented at 27th International Conference on Human-Computer Interaction, 22-27 June 2025, Gothenburg, Sweden. This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in "HCII 2025", and is available online at this https URL
Journal-ref: Human-Computer Interaction. HCII 2025. Lecture Notes in Computer Science, vol 15767
Subjects: Software Engineering (cs.SE); Human-Computer Interaction (cs.HC)

This study investigates the challenges in designing research infrastructure software for automated software publication in multi-stakeholder environments, focusing specifically on the HERMES system. Through two quantitative surveys of research software engineers (RSEs) and infrastructure facility staff (IFs), it examines technical, organizational, and social requirements across these stakeholder groups. The study reveals significant differences in how RSEs and IFs prioritize various system features. While RSEs highly value compatibility with existing infrastructure, IFs prioritize user-focused aspects like system usability and documentation. The research identifies two main challenges in designing research infrastructure software: (1) the existence of multiple stakeholder groups with differing requirements, and (2) the internal heterogeneity within each stakeholder group across dimensions such as technical experience. The study also highlights that only half of RSE respondents actively practice software publication, pointing to potential cultural or technical barriers. Additionally, the research reveals discrepancies in how stakeholders view organizational aspects, with IFs consistently rating factors like responsibility structures and quality assurance as more important than RSEs do. These findings contribute to a better understanding of the complexities involved in designing research infrastructure software and emphasize the need for systems that can accommodate diverse user groups while maintaining usability across different technical expertise levels.

[19] arXiv:2506.01560 [pdf, other]
Title: SPAC: A Python Package for Spatial Single-Cell Analysis of Multiplex Imaging
Fang Liu, Rui He, Andrei Bombin, Ahmad B. Abdallah, Omar Eldaghar, Tommy R. Sheeley, Sam E. Ying, George Zaki
Comments: 7 pages, 1 figure; pre-print submitted to the *Journal of Open Source Software (JOSS)*
Subjects: Software Engineering (cs.SE); Genomics (q-bio.GN)

Multiplexed immunofluorescence microscopy captures detailed measurements of spatially resolved, multiple biomarkers simultaneously, revealing tissue composition and cellular interactions in situ among single cells. The growing scale and dimensional complexity of these datasets demand reproducible, comprehensive and user-friendly computational tools. To address this need, we developed SPAC (SPAtial single-Cell analysis), a Python-based package and a corresponding shiny application within an integrated, modular SPAC ecosystem (Liu et al., 2025) designed specifically for biologists without extensive coding expertise. Following image segmentation and extraction of spatially resolved single-cell data, SPAC streamlines downstream phenotyping and spatial analysis, facilitating characterization of cellular heterogeneity and spatial organization within tissues. Through scalable performance, specialized spatial statistics, highly customizable visualizations, and seamless workflows from dataset to insights, SPAC significantly lowers barriers to sophisticated spatial analyses.

[20] arXiv:2506.01595 [pdf, other]
Title: A Systematic Mapping Study on Software Architecture for AI-based Mobility Systems
Amra Ramic, Stefan Kugele
Comments: 8 pages, 10 figures, 4 tables
Subjects: Software Engineering (cs.SE)

Background: Due to their diversity, complexity, and above all importance, safety-critical and dependable systems must be developed with special diligence. Criticality increases as these systems likely contain artificial intelligence (AI) components known for their uncertainty. As software and reference architectures form the backbone of any successful system, including safety-critical dependable systems with learning-enabled components, choosing the suitable architecture that guarantees safety despite uncertainties is of great eminence.
Aim: We aim to provide the missing overview of all existing architectures, their contribution to safety, and their level of maturity in AI-based safety-critical systems.
Method: To achieve this aim, we report a systematic mapping study. From a set of 1,639 primary studies, we selected 38 relevant studies dealing with safety assurance through software architecture in AI-based safety-critical systems. The selected studies were then examined using various criteria to answer the research questions and identify gaps in this area of research.
Results: Our findings showed which architectures have been proposed and to what extent they have been implemented. Furthermore, we identified gaps in different application areas of those systems and explained these gaps with various arguments.
Conclusion: As the AI trend continues to grow, the system complexity will inevitably increase, too. To ensure the lasting safety of the systems, we provide an overview of the state of the art, intending to identify best practices and research gaps and direct future research more focused.

[21] arXiv:2506.01604 [pdf, html, other]
Title: Exploring Prompt Patterns in AI-Assisted Code Generation: Towards Faster and More Effective Developer-AI Collaboration
Sophia DiCuffa, Amanda Zambrana, Priyanshi Yadav, Sashidhar Madiraju, Khushi Suman, Eman Abdullah AlOmar
Subjects: Software Engineering (cs.SE)

The growing integration of AI tools in software development, particularly Large Language Models (LLMs) such as ChatGPT, has revolutionized how developers approach coding tasks. However, achieving high-quality code often requires iterative interactions, which can be time-consuming and inefficient. This paper explores the application of structured prompt patterns to minimize the number of interactions required for satisfactory AI-assisted code generation. Using the DevGPT dataset, we analyzed seven distinct prompt patterns to evaluate their effectiveness in reducing back-and-forth communication between developers and AI. Our findings highlight patterns such as ''Context and Instruction'' and ''Recipe'' as particularly effective in achieving high-quality outputs with minimal iterations. The study emphasizes the potential for prompt engineering to stream- line developer-AI collaboration, providing practical insights into crafting prompts that balance precision, efficiency, and clarity.

[22] arXiv:2506.01774 [pdf, html, other]
Title: Greening AI-enabled Systems with Software Engineering: A Research Agenda for Environmentally Sustainable AI Practices
Luís Cruz, João Paulo Fernandes, Maja H. Kirkeby, Silverio Martínez-Fernández, June Sallou, Hina Anwar, Enrique Barba Roque, Justus Bogner, Joel Castaño, Fernando Castor, Aadil Chasmawala, Simão Cunha, Daniel Feitosa, Alexandra González, Andreas Jedlitschka, Patricia Lago, Ana Oprescu, Pooja Rani, João Saraiva, Federica Sarro, Raghavendra Selvan, Karthik Vaidhyanathan, Roberto Verdecchia, Ivan P. Yamshchikov, Henry Muccini
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

The environmental impact of Artificial Intelligence (AI)-enabled systems is increasing rapidly, and software engineering plays a critical role in developing sustainable solutions. The "Greening AI with Software Engineering" CECAM-Lorentz workshop (no. 1358, 2025) funded by the Centre Européen de Calcul Atomique et Moléculaire and the Lorentz Center, provided an interdisciplinary forum for 29 participants, from practitioners to academics, to share knowledge, ideas, practices, and current results dedicated to advancing green software and AI research. The workshop was held February 3-7, 2025, in Lausanne, Switzerland. Through keynotes, flash talks, and collaborative discussions, participants identified and prioritized key challenges for the field. These included energy assessment and standardization, benchmarking practices, sustainability-aware architectures, runtime adaptation, empirical methodologies, and education. This report presents a research agenda emerging from the workshop, outlining open research directions and practical recommendations to guide the development of environmentally sustainable AI-enabled systems rooted in software engineering principles.

[23] arXiv:2506.01875 [pdf, other]
Title: Dynamic Software Updating in Java - Comparing Concepts and Resource Demands
Danijel Mlinaric, Vedran Mornar
Subjects: Software Engineering (cs.SE)

Dynamic software updating (DSU) is an extremely useful feature to be used during the software evolution. It can be used to reduce downtime costs, for security enhancements, profiling and testing the new functionalities. There are many researches and solutions on dynamic software updating regarding diverse problems introduced by the topic, but there is a lack of research which compare various approaches concerning supported changes and demands on re-sources. In this paper we are comparing currently available con-cepts for Java programming language that deal with dynamically applied changes and impact of those changes on computer resource demands.

Cross submissions (showing 11 of 11 entries)

[24] arXiv:2506.00204 (cross-list from cs.CL) [pdf, html, other]
Title: Structure-Aware Fill-in-the-Middle Pretraining for Code
Linyuan Gong, Alvin Cheung, Mostafa Elhoushi, Sida Wang
Comments: 14 pages
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

Fill-in-the-Middle (FIM) is a common pretraining method for code LLMs, where models complete code segments given surrounding context. However, existing LLMs treat code as plain text and mask random character spans. We propose and evaluate AST-FIM, a pretraining strategy that leverages Abstract Syntax Trees (ASTs) to mask complete syntactic structures at scale, ensuring coherent training examples better aligned with universal code structures and common code editing patterns such as blocks, expressions, or functions. To evaluate real-world fill-in-the-middle (FIM) programming tasks, we introduce Real-FIM-Eval, a benchmark derived from 30,000+ GitHub commits across 12 languages. On infilling tasks, experiments on 1B and 8B parameter models show that AST-FIM is particularly beneficial for real-world code editing as it outperforms standard random-character FIM by up to 5 pts on standard FIM benchmarks. Our code is publicly available at this https URL.

[25] arXiv:2506.00282 (cross-list from cs.GT) [pdf, html, other]
Title: Shill Bidding Prevention in Decentralized Auctions Using Smart Contracts
M.A. Bouaicha, G. Destefanis, T. Montanaro, N. Lasla, L. Patrono
Subjects: Computer Science and Game Theory (cs.GT); Cryptography and Security (cs.CR); Software Engineering (cs.SE)

In online auctions, fraudulent behaviors such as shill bidding pose significant risks. This paper presents a conceptual framework that applies dynamic, behavior-based penalties to deter auction fraud using blockchain smart contracts. Unlike traditional post-auction detection methods, this approach prevents manipulation in real-time by introducing an economic disincentive system where penalty severity scales with suspicious bidding patterns. The framework employs the proposed Bid Shill Score (BSS) to evaluate nine distinct bidding behaviors, dynamically adjusting the penalty fees to make fraudulent activity financially unaffordable while providing fair competition.
The system is implemented within a decentralized English auction on the Ethereum blockchain, demonstrating how smart contracts enforce transparent auction rules without trusted intermediaries. Simulations confirm the effectiveness of the proposed model: the dynamic penalty mechanism reduces the profitability of shill bidding while keeping penalties low for honest bidders. Performance evaluation shows that the system introduces only moderate gas and latency overhead, keeping transaction costs and response times within practical bounds for real-world use. The approach provides a practical method for behaviour-based fraud prevention in decentralised systems where trust cannot be assumed.

[26] arXiv:2506.00419 (cross-list from cs.CR) [pdf, html, other]
Title: Teaching an Old LLM Secure Coding: Localized Preference Optimization on Distilled Preferences
Mohammad Saqib, Saikat Chakraborty, Santu Karmaker, Niranjan Balasubramanian
Comments: Accepted to ACL 2025 (Main)
Subjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)

LLM generated code often contains security issues. We address two key challenges in improving secure code generation. First, obtaining high quality training data covering a broad set of security issues is critical. To address this, we introduce a method for distilling a preference dataset of insecure and secure code pairs from frontier LLMs, along with a security reasoning that explains the issues and the fix. The key idea here is to make use of security knowledge sources to devise a systematic prompting strategy that ensures broad coverage. Second, aligning models to secure code requires focusing on localized regions of code. Direct preference optimization methods, like SimPO, are not designed to handle these localized differences and turn out to be ineffective. We address this with a new localized preference optimization algorithm that masks the security related tokens in both the winning (secure) and losing (insecure) responses. To prevent loss in code quality, we also add a regularizer. Evaluations show that both training on our dataset, DiSCo, and the new preference optimization algorithm, LPO, yield substantial reductions in code insecurity while also improving overall code quality. Code and dataset are available at this https URL.

[27] arXiv:2506.00645 (cross-list from cs.RO) [pdf, html, other]
Title: AWML: An Open-Source ML-based Robotics Perception Framework to Deploy for ROS-based Autonomous Driving Software
Satoshi Tanaka, Samrat Thapa, Kok Seang Tan, Amadeusz Szymko, Lobos Kenzo, Koji Minoda, Shintaro Tomie, Kotaro Uetake, Guolong Zhang, Isamu Yamashita, Takamasa Horibe
Comments: 17 pages, 9 figures
Subjects: Robotics (cs.RO); Software Engineering (cs.SE)

In recent years, machine learning technologies have played an important role in robotics, particularly in the development of autonomous robots and self-driving vehicles. As the industry matures, robotics frameworks like ROS 2 have been developed and provides a broad range of applications from research to production. In this work, we introduce AWML, a framework designed to support MLOps for robotics. AWML provides a machine learning infrastructure for autonomous driving, supporting not only the deployment of trained models to robotic systems, but also an active learning pipeline that incorporates auto-labeling, semi-auto-labeling, and data mining techniques.

[28] arXiv:2506.00790 (cross-list from cs.CR) [pdf, html, other]
Title: Assessing and Enhancing Quantum Readiness in Mobile Apps
Joseph Strauss, Krishna Upadhyay, A.B. Siddique, Ibrahim Baggili, Umar Farooq
Comments: 2 pages, 2 figures, 1 table. 46th IEEE Symposium on Security and Privacy (Poster Track), 2025
Subjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)

Quantum computers threaten widely deployed cryptographic primitives such as RSA, DSA, and ECC. While NIST has released post-quantum cryptographic (PQC) standards (e.g., Kyber, Dilithium), mobile app ecosystems remain largely unprepared for this transition. We present a large-scale binary analysis of over 4,000 Android apps to assess cryptographic readiness. Our results show widespread reliance on quantum-vulnerable algorithms such as MD5, SHA-1, and RSA, while PQC adoption remains absent in production apps. To bridge the readiness gap, we explore LLM-assisted migration. We evaluate leading LLMs (GPT-4o, Gemini Flash, Claude Sonnet, etc.) for automated cryptographic migration. All models successfully performed simple hash replacements (e.g., SHA-1 to SHA-256). However, none produced correct PQC upgrades due to multi-file changes, missing imports, and lack of context awareness. These results underscore the need for structured guidance and system-aware tooling for post-quantum migration

[29] arXiv:2506.01056 (cross-list from cs.AI) [pdf, html, other]
Title: MCP-Zero: Proactive Toolchain Construction for LLM Agents from Scratch
Xiang Fei, Xiawu Zheng, Hao Feng
Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

Function-calling has enabled large language models (LLMs) to act as tool-using agents, but injecting thousands of tool schemas into the prompt is costly and error-prone. We introduce MCP-Zero, a proactive agent framework that lets the LLM itself decide when and which external tools to retrieve, thereby assembling a task-specific toolchain from scratch. The framework is built upon three components: (1) Proactive Tool Request, where the model emits a structured $\left<\operatorname{tool\_assistant}\right>$ block that explicitly specifies the desired server and task; (2) Hierarchical Vector Routing, a coarse-to-fine retrieval algorithm that first selects candidate servers and then ranks tools within each server based on the semantic similarity; (3) Iterative Proactive Invocation, enabling multi-round, cross-domain toolchain construction with minimal context overhead, and allowing the model to iteratively revise its request when the returned tools are insufficient. To evaluate our approach we also compile MCP-tools, a retrieval dataset comprising 308 MCP servers and 2,797 tools extracted from the official Model-Context-Protocol repository and normalized into a unified JSON schema. Experiments show that MCP-Zero (i) effectively addresses the context overhead problem of existing methods and accurately selects the correct tool from a pool of nearly 3,000 candidates (248.1k tokens); (ii) reduces token consumption by 98\% on the APIbank while maintaining high accuracy; and (iii) supports multi-turn tool invocation with consistent accuracy across rounds. The code and dataset will be released soon.

[30] arXiv:2506.01074 (cross-list from cs.CL) [pdf, html, other]
Title: How Programming Concepts and Neurons Are Shared in Code Language Models
Amir Hossein Kargaran, Yihong Liu, François Yvon, Hinrich Schütze
Comments: ACL Findings 2025
Subjects: Computation and Language (cs.CL); Programming Languages (cs.PL); Software Engineering (cs.SE)

Several studies have explored the mechanisms of large language models (LLMs) in coding tasks, but most have focused on programming languages (PLs) in a monolingual setting. In this paper, we investigate the relationship between multiple PLs and English in the concept space of LLMs. We perform a few-shot translation task on 21 PL pairs using two Llama-based models. By decoding the embeddings of intermediate layers during this task, we observe that the concept space is closer to English (including PL keywords) and assigns high probabilities to English tokens in the second half of the intermediate layers. We analyze neuron activations for 11 PLs and English, finding that while language-specific neurons are primarily concentrated in the bottom layers, those exclusive to each PL tend to appear in the top layers. For PLs that are highly aligned with multiple other PLs, identifying language-specific neurons is not feasible. These PLs also tend to have a larger keyword set than other PLs and are closer to the model's concept space regardless of the input/output PL in the translation task. Our findings provide insights into how LLMs internally represent PLs, revealing structural patterns in the model's concept space. Code is available at this https URL.

[31] arXiv:2506.01423 (cross-list from cs.AI) [pdf, html, other]
Title: FinRobot: Generative Business Process AI Agents for Enterprise Resource Planning in Finance
Hongyang Yang, Likun Lin, Yang She, Xinyu Liao, Jiaoyang Wang, Runjia Zhang, Yuquan Mo, Christina Dan Wang
Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE); General Finance (q-fin.GN)

Enterprise Resource Planning (ERP) systems serve as the digital backbone of modern financial institutions, yet they continue to rely on static, rule-based workflows that limit adaptability, scalability, and intelligence. As business operations grow more complex and data-rich, conventional ERP platforms struggle to integrate structured and unstructured data in real time and to accommodate dynamic, cross-functional workflows.
In this paper, we present the first AI-native, agent-based framework for ERP systems, introducing a novel architecture of Generative Business Process AI Agents (GBPAs) that bring autonomy, reasoning, and dynamic optimization to enterprise workflows. The proposed system integrates generative AI with business process modeling and multi-agent orchestration, enabling end-to-end automation of complex tasks such as budget planning, financial reporting, and wire transfer processing. Unlike traditional workflow engines, GBPAs interpret user intent, synthesize workflows in real time, and coordinate specialized sub-agents for modular task execution. We validate the framework through case studies in bank wire transfers and employee reimbursements, two representative financial workflows with distinct complexity and data modalities. Results show that GBPAs achieve up to 40% reduction in processing time, 94% drop in error rate, and improved regulatory compliance by enabling parallelism, risk control insertion, and semantic reasoning. These findings highlight the potential of GBPAs to bridge the gap between generative AI capabilities and enterprise-grade automation, laying the groundwork for the next generation of intelligent ERP systems.

[32] arXiv:2506.01631 (cross-list from cs.LG) [pdf, html, other]
Title: Gradient-Based Model Fingerprinting for LLM Similarity Detection and Family Classification
Zehao Wu, Yanjie Zhao, Haoyu Wang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

As Large Language Models (LLMs) become integral software components in modern applications, unauthorized model derivations through fine-tuning, merging, and redistribution have emerged as critical software engineering challenges. Unlike traditional software where clone detection and license compliance are well-established, the LLM ecosystem lacks effective mechanisms to detect model lineage and enforce licensing agreements. This gap is particularly problematic when open-source model creators, such as Meta's LLaMA, require derivative works to maintain naming conventions for attribution, yet no technical means exist to verify compliance.
To fill this gap, treating LLMs as software artifacts requiring provenance tracking, we present TensorGuard, a gradient-based fingerprinting framework for LLM similarity detection and family classification. Our approach extracts model-intrinsic behavioral signatures by analyzing gradient responses to random input perturbations across tensor layers, operating independently of training data, watermarks, or specific model formats. TensorGuard supports the widely-adopted safetensors format and constructs high-dimensional fingerprints through statistical analysis of gradient features. These fingerprints enable two complementary capabilities: direct pairwise similarity assessment between arbitrary models through distance computation, and systematic family classification of unknown models via the K-Means clustering algorithm with domain-informed centroid initialization using known base models. Experimental evaluation on 58 models comprising 8 base models and 50 derivatives across five model families (Llama, Qwen, Gemma, Phi, Mistral) demonstrates 94% classification accuracy under our centroid-initialized K-Means clustering.

[33] arXiv:2506.01770 (cross-list from cs.CR) [pdf, html, other]
Title: ReGA: Representation-Guided Abstraction for Model-based Safeguarding of LLMs
Zeming Wei, Chengcan Wu, Meng Sun
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)

Large Language Models (LLMs) have achieved significant success in various tasks, yet concerns about their safety and security have emerged. In particular, they pose risks in generating harmful content and vulnerability to jailbreaking attacks. To analyze and monitor machine learning models, model-based analysis has demonstrated notable potential in stateful deep neural networks, yet suffers from scalability issues when extending to LLMs due to their vast feature spaces. In this paper, we propose ReGA, a model-based analysis framework with representation-guided abstraction, to safeguard LLMs against harmful prompts and generations. By leveraging safety-critical representations, which are low-dimensional directions emerging in hidden states that indicate safety-related concepts, ReGA effectively addresses the scalability issue when constructing the abstract model for safety modeling. Our comprehensive evaluation shows that ReGA performs sufficiently well in distinguishing between safe and harmful inputs, achieving an AUROC of 0.975 at the prompt level and 0.985 at the conversation level. Additionally, ReGA exhibits robustness to real-world attacks and generalization across different safety perspectives, outperforming existing safeguard paradigms in terms of interpretability and scalability. Overall, ReGA serves as an efficient and scalable solution to enhance LLM safety by integrating representation engineering with model-based abstraction, paving the way for new paradigms to utilize software insights for AI safety. Our code is available at this https URL.

[34] arXiv:2506.01825 (cross-list from cs.CR) [pdf, html, other]
Title: Which Factors Make Code LLMs More Vulnerable to Backdoor Attacks? A Systematic Study
Chenyu Wang, Zhou Yang, Yaniv Harel, David Lo
Subjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)

Code LLMs are increasingly employed in software development. However, studies have shown that they are vulnerable to backdoor attacks: when a trigger (a specific input pattern) appears in the input, the backdoor will be activated and cause the model to generate malicious outputs. Researchers have designed various triggers and demonstrated the feasibility of implanting backdoors by poisoning a fraction of the training data. Some basic conclusions have been made, such as backdoors becoming easier to implant when more training data are modified. However, existing research has not explored other factors influencing backdoor attacks on Code LLMs, such as training batch size, epoch number, and the broader design space for triggers, e.g., trigger length.
To bridge this gap, we use code summarization as an example to perform an empirical study that systematically investigates the factors affecting backdoor effectiveness and understands the extent of the threat posed. Three categories of factors are considered: data, model, and inference, revealing previously overlooked findings. We find that the prevailing consensus -- that attacks are ineffective at extremely low poisoning rates -- is incorrect. The absolute number of poisoned samples matters as well. Specifically, poisoning just 20 out of 454K samples (0.004\% poisoning rate -- far below the minimum setting of 0.1\% in prior studies) successfully implants backdoors! Moreover, the common defense is incapable of removing even a single poisoned sample from it. Additionally, small batch sizes increase the risk of backdoor attacks. We also uncover other critical factors such as trigger types, trigger length, and the rarity of tokens in the triggers, leading to valuable insights for assessing Code LLMs' vulnerability to backdoor attacks. Our study highlights the urgent need for defense mechanisms against extremely low poisoning rate settings.

Replacement submissions (showing 25 of 25 entries)

[35] arXiv:2310.14843 (replaced) [pdf, html, other]
Title: NoCodeGPT: A No-Code Interface for Building Web Apps with Language Models
Mauricio Monteiro, Bruno Castelo Branco, Samuel Silvestre, Guilherme Avelino, Marco Tulio Valente
Comments: Accepted at Software: Practice and Experience. Open access version available at: this https URL
Subjects: Software Engineering (cs.SE)

In this paper, we first report an exploratory study where three participants were instructed to use ChatGPT to implement a simple Web-based application. A key finding of this study revealed that ChatGPT does not offer a user-friendly interface for building applications, even small web systems. For example, one participant with limited experience in software development was unable to complete any of the proposed user stories. Then, and as the primary contribution of this work, we decided to design, implement, and evaluate a tool that offers a customized interface for language models like GPT, specifically targeting the implementation of small web applications without writing code. This tool, called NoCodeGPT, instruments the prompts sent to the language model with useful contextual information (e.g., the files that need to be modified when the user identifies and requests a bug fix). It also saves the files generated by the language model in the correct directories. Additionally, a simple version control feature is offered, allowing users to quickly revert to a previous version of the code when the model enters a hallucination process, generating worthless results. To evaluate our tool, we invited 14 students with limited Web development experience to implement two small web applications using only prompts and NoCodeGPT. Overall, the results of this evaluation were quite satisfactory and significantly better than those of the initial study (the one using the standard ChatGPT interface). More than half of the participants (9 out of 14) successfully completed the proposed applications, while the others completed at least half of the proposed user stories.

[36] arXiv:2401.16310 (replaced) [pdf, html, other]
Title: An Insight into Security Code Review with LLMs: Capabilities, Obstacles, and Influential Factors
Jiaxin Yu, Peng Liang, Yujia Fu, Amjed Tahir, Mojtaba Shahin, Chong Wang, Yangxiao Cai
Comments: 21 pages, 5 images, 8 tables, Manuscript submitted to a journal (2025)
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

Security code review is a time-consuming and labor-intensive process typically requiring integration with automated security defect detection tools. However, existing security analysis tools struggle with poor generalization, high false positive rates, and coarse detection granularity. Large Language Models (LLMs) have been considered promising candidates for addressing those challenges. In this study, we conducted an empirical study to explore the potential of LLMs in detecting security defects during code review. Specifically, we evaluated the performance of six LLMs under five different prompts and compared them with state-of-the-art static analysis tools. We also performed linguistic and regression analyses for the best-performing LLM to identify quality problems in its responses and factors influencing its performance. Our findings showthat: (1) existing pre-trained LLMs have limited capability in security code review but significantly outperformthe state-of-the-art static analysis tools. (2) GPT-4 performs best among all LLMs when provided with a CWE list for reference. (3) GPT-4 frequently generates verbose or non-compliant responses with the task requirements given in the prompts. (4) GPT-4 is more adept at identifying security defects in code files with fewer tokens, containing functional logic, or written by developers with less involvement in the project.

[37] arXiv:2404.10304 (replaced) [pdf, html, other]
Title: LLM-Powered Test Case Generation for Detecting Bugs in Plausible Programs
Kaibo Liu, Zhenpeng Chen, Yiyang Liu, Jie M. Zhang, Mark Harman, Yudong Han, Yun Ma, Yihong Dong, Ge Li, Gang Huang
Comments: Accepted by the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025) Main Track
Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG)

Detecting tricky bugs in plausible programs, those that pass existing test suites yet still contain bugs, remains a significant challenge in software testing. To address this problem, we propose TrickCatcher, an LLM-powered approach to generating test cases for uncovering bugs in plausible programs. TrickCatcher operates in three stages: First, it uses an LLM to generate program variants based on the program under test (PUT) and its specification. Second, it employs an LLM to construct an input generator from the specification for producing test inputs. Finally, these inputs are executed on both the PUT and its program variants to detect inconsistencies in their outputs. We evaluate TrickCatcher on two datasets, TrickyBugs and EvalPlus, which include 366 human-written and 151 AI-generated plausible programs with tricky bugs. TrickCatcher achieves recall, precision, and F1 scores that are 1.80x, 2.65x, and 1.66x those of the state-of-the-art baselines, respectively. Code and data used are available at this https URL.

[38] arXiv:2404.19318 (replaced) [pdf, html, other]
Title: Calibration of Large Language Models on Code Summarization
Yuvraj Virk, Premkumar Devanbu, Toufique Ahmed
Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL)

A brief, fluent, and relevant summary can be helpful during program comprehension; however, such a summary does require significant human effort to produce. Often, good summaries are unavailable in software projects, which makes maintenance more difficult. There has been a considerable body of research into automated AI-based methods, using Large Language models (LLMs), to generate summaries of code; there also has been quite a bit of work on ways to measure the performance of such summarization methods, with special attention paid to how closely these AI-generated summaries resemble a summary a human might have produced. Measures such as BERTScore and BLEU have been suggested and evaluated with human-subject studies.
However, LLM-generated summaries can be inaccurate, incomplete, etc.: generally, too dissimilar to one that a good developer might write. Given an LLM-generated code summary, how can a user rationally judge if a summary is sufficiently good and reliable? Given just some input source code, and an LLM-generated summary, existing approaches can help judge brevity, fluency and relevance of the summary; however, it's difficult to gauge whether an LLM-generated summary sufficiently resembles what a human might produce, without a "golden" human-produced summary to compare against. We study this resemblance question as calibration problem: given just the code & the summary from an LLM, can we compute a confidence measure, that provides a reliable indication of whether the summary sufficiently resembles what a human would have produced in this situation? We examine this question using several LLMs, for several languages, and in several different settings. Our investigation suggests approaches to provide reliable predictions of the likelihood that an LLM-generated summary would sufficiently resemble a summary a human might write for the same code.

[39] arXiv:2405.04860 (replaced) [pdf, other]
Title: Quantum Concolic Testing
Shangzhou Xia, Jianjun Zhao, Fuyuan Zhang, Xiaoyu Guo
Subjects: Software Engineering (cs.SE); Quantum Physics (quant-ph)

This paper presents the first concolic testing framework explicitly designed for quantum programs. The framework introduces quantum constraint generation methods for quantum control statements that quantify quantum states and offers a symbolization method for quantum variables. Based on this framework, we generate path constraints for each concrete execution path of a quantum program. These constraints guide the exploration of new paths, with a quantum constraint solver determining outcomes to create novel input samples, thereby enhancing branch coverage. Our framework has been implemented in Python and integrated with Qiskit for practical evaluation. Experimental results show that our concolic testing framework improves branch coverage, generates high-quality quantum input samples, and detects bugs, demonstrating its effectiveness and efficiency in quantum programming and bug detection. Regarding branch coverage, our framework achieves more than 74.27% on quantum programs with under 5 qubits.

[40] arXiv:2405.17492 (replaced) [pdf, html, other]
Title: StatWhy: Formal Verification Tool for Statistical Hypothesis Testing Programs
Yusuke Kawamoto, Kentaro Kobayashi, Kohei Suenaga
Comments: Accepted to CAV 2025 (the 37th International Conference on Computer Aided Verification)
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

Statistical methods have been widely misused and misinterpreted in various scientific fields, raising significant concerns about the integrity of scientific research. To mitigate this problem, we propose a tool-assisted method for formally specifying and automatically verifying the correctness of statistical programs. In this method, programmers are required to annotate the source code of the statistical programs with the requirements for these methods. Through this annotation, they are reminded to check the requirements for statistical methods, including those that cannot be formally verified, such as the distribution of the unknown true population. Our software tool StatWhy automatically checks whether programmers have properly specified the requirements for the statistical methods, thereby identifying any missing requirements that need to be addressed. This tool is implemented using the Why3 platform to verify the correctness of OCaml programs that conduct statistical hypothesis testing. We demonstrate how StatWhy can be used to avoid common errors in various statistical hypothesis testing programs.

[41] arXiv:2409.04588 (replaced) [pdf, html, other]
Title: PVAC: Package Version Activity Categorizer, Leveraging Semantic Versioning in a Heterogeneous System
Shane K. Panter, Luke Hindman, Nasir U. Eisty
Comments: Accepted to Empirical Software Engineering
Journal-ref: Empir Software Eng 30, 118 (2025)
Subjects: Software Engineering (cs.SE)

Context: Modern open-source software ecosystems, such as those managed by GNU/Linux distributions, are composed of numerous packages developed independently by diverse communities. These ecosystems employ package management tools to facilitate software installation and dependency resolution. However, these tools lack robust mechanisms for systematically evaluating the development activity and versioning dynamics within their heterogeneous software environments. Objective: This research aims to introduce a systematic method and a prototype tool for assessing version activity within heterogeneous package manager ecosystems, enabling quantitative analysis of software package updates. Method: We developed a Package Version Activity Categorizer (PVAC) that consists of three components. The Version Categorizer (VC), which categorizes diverse semantic version numbers, a Version Number Delta (VND) component, which calculates a numeric score representing the aggregated semantic version changes across packages at the ecosystem level, and finally, an Activity Categorizer (AC) that categorizes the activity of individual packages within that ecosystem. PVAC utilizes tailored regular expressions to parse semantic versioning details (epoch, major, minor, and patch versions) from diverse package version strings, enabling consistent categorization and quantitative scoring of version changes. Results: PVAC was empirically evaluated using a dataset of 22,535 packages drawn from recent releases of Debian and Ubuntu GNU/Linux distributions. Our findings demonstrate PVAC's effectiveness for accurately categorizing versioning schemes and quantitatively measuring version activity across releases. We provide empirical evidence confirming that semantic versioning, including adapted variations, is predominantly employed across these ecosystems.

[42] arXiv:2411.03079 (replaced) [pdf, html, other]
Title: Utilizing Precise and Complete Code Context to Guide LLM in Automatic False Positive Mitigation
Jinbao Chen (1), Hongjing Xiang (1), Zuohong Zhao (1), Luhao Li (1), Yu Zhang (1), Boyao Ding (1), Qingwei Li (1)Songyuan Xiong (1) ((1) University of Science and Technology of China)
Comments: 13 pages
Subjects: Software Engineering (cs.SE)

Static Application Security Testing (SAST) tools are critical to software quality, identifying potential code issues early in development. However, they often produce false positive warnings that require manual review, slowing down development. Thus, automating false positive mitigation (FPM) is essential. The advent of Large Language Models (LLMs), with their strong abilities in natural language and code understanding, offers promising avenues for FPM. Yet current LLM-based FPM method faces two major limitations: 1. The warning-related code snippets extracted are overly broad and cluttered with irrelevant control/data flows, reducing precision; 2. Critical code contexts are missing, leading to incomplete representations that can mislead LLMs and cause inaccurate assessments. To overcome these limitations, we propose LLM4FPM , a lightweight and efficient false positive mitigation framework. It features eCPG-Slicer, which builds an extended code property graph (eCPG) to extract precise line-level code contexts for warnings. Furthermore, the integrated FARF algorithm builds a file reference graph to identify all files that are relevant to warnings in linear time. This enables eCPG-Slicer to obtain rich contextual information without resorting to expensive whole-program analysis. LLM4FPM outperforms the existing method on the Juliet dataset (F1 > 99% across various Common Weakness Enumerations) and improves label accuracy on the D2A dataset to 86%. By leveraging a lightweight open-source LLM, LLM4FPM can significantly save inspection costs up to \$2758 per run (\$0.384 per warning) on Juliet with an average inspection time of 4.7s per warning. Moreover, real-world tests on popular C/C++ projects demonstrate its practicality.

[43] arXiv:2502.12267 (replaced) [pdf, html, other]
Title: NeuroStrata: Harnessing Neurosymbolic Paradigms for Improved Design, Testability, and Verifiability of Autonomous CPS
Xi Zheng, Ziyang Li, Ivan Ruchkin, Ruzica Piskac, Miroslav Pajic
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

Autonomous cyber-physical systems (CPSs) leverage AI for perception, planning, and control but face trust and safety certification challenges due to inherent uncertainties. The neurosymbolic paradigm replaces stochastic layers with interpretable symbolic AI, enabling determinism. While promising, challenges like multisensor fusion, adaptability, and verification remain. This paper introduces NeuroStrata, a neurosymbolic framework to enhance the testing and verification of autonomous CPS. We outline its key components, present early results, and detail future plans.

[44] arXiv:2503.02817 (replaced) [pdf, html, other]
Title: Open Source at a Crossroads: The Future of Licensing Driven by Monetization
Raula Gaikovina Kula, Brittany Anne Reid, Christoph Treude
Comments: Submitted to SE 2030 Software Engineering Roadmap Workshop
Subjects: Software Engineering (cs.SE)

The widespread adoption of open source libraries and frameworks can be attributed to their licensing. Open Source Software Licenses (OSS licenses) ensure that software can be sold or distributed as part of aggregate programs from various sources without requiring a royalty or fee. The quality of such code rivals that of commercial software, with open source libraries forming large parts of the supply chain for critical commercial systems in industry. Despite this, most open source projects rely on volunteer contributions, and unpaid library maintainers face significant pressure to sustain their projects. One potential solution for these projects is to change their licensing to ensure that maintainers are compensated accordingly for their work. In this paper, we explore the potential of licensing to help alleviate funding issues, with a review of three different cases where OSS licenses were modified to allow for monetization. In addition, we explore licensing concerns related to the emergence of the use of artificial intelligence (AI) in software development. We argue that open source is at a crossroads, with a growing need to redefine its licensing models and support communities and critical software. We identify specific research opportunities and conclude with a research agenda comprising a series of research questions to guide future studies in this area.

[45] arXiv:2503.07010 (replaced) [pdf, html, other]
Title: ProjectEval: A Benchmark for Programming Agents Automated Evaluation on Project-Level Code Generation
Kaiyuan Liu, Youcheng Pan, Yang Xiang, Daojing He, Jing Li, Yexing Du, Tianrun Gao
Comments: 17 pages (9 Appendix pages), 4 figures, 7 tables
Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL)

Recently, LLM agents have made rapid progress in improving their programming capabilities. However, existing benchmarks lack the ability to automatically evaluate from users' perspective, and also lack the explainability of the results of LLM agents' code generation capabilities. Thus, we introduce ProjectEval, a new benchmark for LLM agents project-level code generation's automated evaluation by simulating user interaction. ProjectEval is constructed by LLM with human reviewing. It has three different level inputs of natural languages or code skeletons. ProjectEval can evaluate the generated projects by user interaction simulation for execution, and by code similarity through existing objective indicators. Through ProjectEval, we find that systematic engineering project code, overall understanding of the project and comprehensive analysis capability are the keys for LLM agents to achieve practical projects. Our findings and benchmark provide valuable insights for developing more effective programming agents that can be deployed in future real-world production.

[46] arXiv:2503.16167 (replaced) [pdf, html, other]
Title: CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models
Hong Yi Lin, Chunhua Liu, Haoyu Gao, Patanamon Thongtanunam, Christoph Treude
Comments: The paper is published in Findings of the Association for Computational Linguistics (ACL 2025)
Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL)

State-of-the-art large language models (LLMs) have demonstrated impressive code generation capabilities but struggle with real-world software engineering tasks, such as revising source code to address code reviews, hindering their practical use. Code review comments are often implicit, ambiguous, and colloquial, requiring models to grasp both code and human intent. This challenge calls for evaluating large language models' ability to bridge both technical and conversational contexts. While existing work has employed the automated code refinement (ACR) task to resolve these comments, current evaluation methods fall short, relying on text matching metrics that provide limited insight into model failures and remain susceptible to training data contamination. To address these limitations, we introduce a novel evaluation benchmark, $\textbf{CodeReviewQA}$ that enables us to conduct fine-grained assessment of model capabilities and mitigate data contamination risks. In CodeReviewQA, we decompose the generation task of code refinement into $\textbf{three essential reasoning steps}$: $\textit{change type recognition}$ (CTR), $\textit{change localisation}$ (CL), and $\textit{solution identification}$ (SI). Each step is reformulated as multiple-choice questions with varied difficulty levels, enabling precise assessment of model capabilities, while mitigating data contamination risks. Our comprehensive evaluation spans 72 recently released large language models on $\textbf{900 manually curated, high-quality examples}$ across nine programming languages. Our results show that CodeReviewQA is able to expose specific model weaknesses in code review comprehension, disentangled from their generative automated code refinement results.

[47] arXiv:2503.19449 (replaced) [pdf, html, other]
Title: VecTrans: Enhancing Compiler Auto-Vectorization through LLM-Assisted Code Transformations
Zhongchun Zheng, Kan Wu, Long Cheng, Lu Li, Rodrigo C. O. Rocha, Tianyi Liu, Wei Wei, Jianjiang Zeng, Xianwei Zhang, Yaoqing Gao
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)

Auto-vectorization is a fundamental optimization for modern compilers to exploit SIMD parallelism. However, state-of-the-art approaches still struggle to handle intricate code patterns, often requiring manual hints or domain-specific expertise. Large language models (LLMs), with their ability to capture intricate patterns, provide a promising solution, yet their effective application in compiler optimizations remains an open challenge due to issues such as hallucinations and a lack of domain-specific reasoning. In this paper, we present VecTrans, a novel framework that leverages LLMs to enhance compiler-based code vectorization. VecTrans first employs compiler analysis to identify potentially vectorizable code regions. It then utilizes an LLM to refactor these regions into patterns that are more amenable to the compilers auto-vectorization. To ensure semantic correctness, VecTrans further integrates a hybrid validation mechanism at the intermediate representation (IR) level. With the above efforts, VecTrans combines the adaptability of LLMs with the precision of compiler vectorization, thereby effectively opening up the vectorization opportunities. experimental results show that among all TSVC functions unvectorizable by GCC, ICC, Clang, and BiSheng Compiler, VecTrans achieves an geomean speedup of 1.77x and successfully vectorizes 24 of 51 test cases. This marks a significant advancement over state-of-the-art approaches while maintaining a cost efficiency of $0.012 per function optimization for LLM API usage.

[48] arXiv:2504.11711 (replaced) [pdf, html, other]
Title: The Hitchhiker's Guide to Program Analysis, Part II: Deep Thoughts by LLMs
Haonan Li, Hang Zhang, Kexin Pei, Zhiyun Qian
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

Static analysis plays a crucial role in software vulnerability detection, yet faces a persistent precision-scalability tradeoff. In large codebases like the Linux kernel, traditional static analysis tools often generate excessive false positives due to simplified vulnerability modeling and overapproximation of path and data constraints. While large language models (LLMs) demonstrate promising code understanding capabilities, their direct application to program analysis remains unreliable due to inherent reasoning limitations.
We introduce BugLens, a post-refinement framework that significantly enhances static analysis precision for bug detection. BugLens guides LLMs through structured reasoning steps to assess security impact and validate constraints from the source code. When evaluated on Linux kernel taint-style bugs detected by static analysis tools, BugLens improves precision approximately 7-fold (from 0.10 to 0.72), substantially reducing false positives while uncovering four previously unreported vulnerabilities. Our results demonstrate that a well-structured, fully automated LLM-based workflow can effectively complement and enhance traditional static analysis techniques.

[49] arXiv:2505.01136 (replaced) [pdf, html, other]
Title: Descriptor: C++ Self-Admitted Technical Debt Dataset (CppSATD)
Phuoc Pham, Murali Sridharan, Matteo Esposito, Valentina Lenarduzzi
Subjects: Software Engineering (cs.SE); Information Retrieval (cs.IR); Machine Learning (cs.LG); Programming Languages (cs.PL)

In software development, technical debt (TD) refers to suboptimal implementation choices made by the developers to meet urgent deadlines and limited resources, posing challenges for future maintenance. Self-Admitted Technical Debt (SATD) is a sub-type of TD, representing specific TD instances ``openly admitted'' by the developers and often expressed through source code comments. Previous research on SATD has focused predominantly on the Java programming language, revealing a significant gap in cross-language SATD. Such a narrow focus limits the generalizability of existing findings as well as SATD detection techniques across multiple programming languages. Our work addresses such limitation by introducing CppSATD, a dedicated C++ SATD dataset, comprising over 531,000 annotated comments and their source code contexts. Our dataset can serve as a foundation for future studies that aim to develop SATD detection methods in C++, generalize the existing findings to other languages, or contribute novel insights to cross-language SATD research.

[50] arXiv:2505.12185 (replaced) [pdf, html, other]
Title: EVALOOP: Assessing LLM Robustness in Programming from a Self-consistency Perspective
Sen Fang, Weiyuan Ding, Bowen Xu
Comments: 19 pages, 11 figures
Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)

Assessing the programming capabilities of Large Language Models (LLMs) is crucial for their effective use in software engineering. Current evaluations, however, predominantly measure the accuracy of generated code on static benchmarks, neglecting the critical aspect of model robustness during programming tasks. While adversarial attacks offer insights on model robustness, their effectiveness is limited and evaluation could be constrained. Current adversarial attack methods for robustness evaluation yield inconsistent results, struggling to provide a unified evaluation across different LLMs. We introduce EVALOOP, a novel assessment framework that evaluate the robustness from a self-consistency perspective, i.e., leveraging the natural duality inherent in popular software engineering tasks, e.g., code generation and code summarization. EVALOOP initiates a self-contained feedback loop: an LLM generates output (e.g., code) from an input (e.g., natural language specification), and then use the generated output as the input to produce a new output (e.g., summarizes that code into a new specification). EVALOOP repeats the process to assess the effectiveness of EVALOOP in each loop. This cyclical strategy intrinsically evaluates robustness without rely on any external attack setups, providing a unified metric to evaluate LLMs' robustness in programming. We evaluate 16 prominent LLMs (e.g., GPT-4.1, O4-mini) on EVALOOP and found that EVALOOP typically induces a 5.01%-19.31% absolute drop in pass@1 performance within ten loops. Intriguingly, robustness does not always align with initial performance (i.e., one-time query); for instance, GPT-3.5-Turbo, despite superior initial code generation compared to DeepSeek-V2, demonstrated lower robustness over repeated evaluation loop.

[51] arXiv:2505.21323 (replaced) [pdf, html, other]
Title: A first look at ROS 2 applications written in asynchronous Rust
Martin Škoudlil, Michal Sojka, Zdeněk Hanzálek
Subjects: Software Engineering (cs.SE)

The increasing popularity of the Rust programming language in building robotic applications using the Robot Operating System (ROS 2) raises questions about its real-time execution capabilities, particularly when employing asynchronous programming. Existing real-time scheduling and response-time analysis techniques for ROS 2 focus on applications written in C++ and do not address the unique execution models and challenges presented by Rust's asynchronous programming paradigm. In this paper, we analyze the execution model of R2R -- an asynchronous Rust ROS 2 bindings and various asynchronous Rust runtimes, comparing them with the execution model of C++ ROS 2 applications. We propose a structured approach for R2R applications aimed at deterministic real-time operation involving thread prioritization and callback-to-thread mapping schemes. Our experimental evaluation based on measuring end-to-end latencies of a synthetic application shows that the proposed approach is effective and outperforms other evaluated configurations. A more complex autonomous driving case study demonstrates its practical applicability. Overall, the experimental results indicate that our proposed structure achieves bounded response times for time-critical tasks. This paves the way for future work to adapt existing or develop new response-time analysis techniques for R2R applications using our structure.

[52] arXiv:2505.23387 (replaced) [pdf, html, other]
Title: Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization
Mingzhe Du, Luu Tuan Tuan, Yue Liu, Yuhao Qing, Dong Huang, Xinyi He, Qian Liu, Zejun Ma, See-kiong Ng
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

Large Language Models (LLMs) generate functionally correct solutions but often fall short in code efficiency, a critical bottleneck for real-world deployment. In this paper, we introduce a novel test-time iterative optimization framework to address this, employing a closed-loop system where LLMs iteratively refine code based on empirical performance feedback from an execution sandbox. We explore three training strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO). Experiments on our Venus dataset and the APPS benchmark show that SFT and DPO rapidly saturate in efficiency gains. In contrast, GRPO, using reinforcement learning (RL) with execution feedback, continuously optimizes code performance, significantly boosting both pass@1 (from 47% to 62%) and the likelihood of outperforming human submissions in efficiency (from 31% to 45%). Our work demonstrates effective test-time code efficiency improvement and critically reveals the power of RL in teaching LLMs to truly self-improve code efficiency.

[53] arXiv:2505.23419 (replaced) [pdf, html, other]
Title: SWE-bench Goes Live!
Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang
Comments: Homepage: \url{this https URL}, Code: \url{this https URL}, Dataset: \url{this https URL}
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in this domain, they suffer from key limitations: they have not been updated since their initial releases, cover a narrow set of repositories, and depend heavily on manual effort for instance construction and environment setup. These factors hinder scalability and introduce risks of overfitting and data contamination. In this work, we present SWE-bench-Live, a live-updatable benchmark designed to overcome these challenges. Our initial release consists of 1,319 tasks derived from real GitHub issues created since 2024, spanning 93 repositories. Each task is accompanied by a dedicated Docker image to ensure reproducible execution. Central to our benchmark is \method, an automated curation pipeline that streamlines the entire process from instance creation to environment setup, removing manual bottlenecks and enabling scalability and continuous updates. We evaluate a range of state-of-the-art agent frameworks and LLMs on SWE-bench-Live, revealing a substantial performance gap compared to static benchmarks like SWE-bench, even under controlled evaluation conditions. To better understand this discrepancy, we perform detailed analyses across repository origin, issue recency, and task difficulty. By providing a fresh, diverse, and executable benchmark grounded in live repository activity, SWE-bench-Live facilitates rigorous, contamination-resistant evaluation of LLMs and agents in dynamic, real-world software development settings.

[54] arXiv:2505.23516 (replaced) [pdf, html, other]
Title: The CASE Framework -- A New Architecture for Participatory Research and Digital Health Surveillance
Marco Hirsch, Peter Hevesi, Paul Lukowicz
Comments: 10 pages, 5 figures. Submitted as a preprint to arXiv (no prior publication)
Subjects: Software Engineering (cs.SE); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

We present the CASE framework, an open-source platform for adaptive, context-aware participatory research, and pandemic preparedness. CASE implements an event-driven architecture that enables dynamic survey workflows, allowing real-time adaptation based on participant responses, external data, temporal conditions, and evolving user states. The framework supports a broad range of research needs, from simple one-time questionnaires to complex longitudinal studies with advanced conditional logic. Built on over a decade of practical experience, CASE underwent a major architectural rework in 2024, transitioning from a microservice-based design to a streamlined monolithic architecture. This evolution significantly improved maintainability, flexibility, and accessibility to deployment, particularly for institutions with limited technical capacity. CASE has been successfully deployed across diverse domains, powering national disease surveillance platforms, supporting post-COVID cohort studies, and enabling real-time sentiment analysis during political events. These applications, involving tens of thousands of participants, demonstrate the framework's scalability, versatility, and practical value. This paper describes the foundations of CASE, details its architectural evolution, and presents lessons learned from real-world deployments. We establish CASE as a mature and reusable research infrastructure that balances sophisticated functionality with practical implementation, addressing the critical global need for sustainable and institutionally controlled data collection systems.

[55] arXiv:2505.23671 (replaced) [pdf, other]
Title: GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, Ion Stoica
Comments: Website: this https URL
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models' capabilities in developing high-performance software. We develop an automated pipeline that generates and executes performance tests to analyze repository commit histories to identify 102 challenging optimization tasks across 10 codebases, spanning diverse domains and programming languages. An agent is provided with a codebase and performance test as a precise specification, and tasked to improve the runtime efficiency, which is measured against the expert developer optimization. Our quantitative evaluation reveals that leading SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling. Our qualitative analysis identifies key failure modes, including difficulties with low-level languages, practicing lazy optimization strategies, and challenges in accurately localizing bottlenecks. We release the code and artifacts of our benchmark along with agent trajectories to enable future research.

[56] arXiv:2408.16028 (replaced) [pdf, html, other]
Title: ANVIL: Anomaly-based Vulnerability Identification without Labelled Training Data
Weizhou Wang, Eric Liu, Xiangyu Guo, Xiao Hu, Ilya Grishchenko, David Lie
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)

Supervised-learning-based vulnerability detectors often fall short due to limited labelled training data. In contrast, Large Language Models (LLMs) like GPT-4 are trained on vast unlabelled code corpora, yet perform only marginally better than coin flips when directly prompted to detect vulnerabilities. In this paper, we reframe vulnerability detection as anomaly detection, based on the premise that vulnerable code is rare and thus anomalous relative to patterns learned by LLMs. We introduce ANVIL, which performs a masked code reconstruction task: the LLM reconstructs a masked line of code, and deviations from the original are scored as anomalies. We propose a hybrid anomaly score that combines exact match, cross-entropy loss, prediction confidence, and structural complexity. We evaluate our approach across multiple LLM families, scoring methods, and context sizes, and against vulnerabilities after the LLM's training cut-off. On the PrimeVul dataset, ANVIL outperforms state-of-the-art supervised detectors-LineVul, LineVD, and LLMAO-achieving up to 2x higher Top-3 accuracy, 75% better Normalized MFR, and a significant improvement on ROC-AUC. Finally, by integrating ANVIL with fuzzers, we uncover two previously unknown vulnerabilities, demonstrating the practical utility of anomaly-guided detection.

[57] arXiv:2503.05315 (replaced) [pdf, html, other]
Title: LoRACode: LoRA Adapters for Code Embeddings
Saumya Chaturvedi, Aman Chadha, Laurent Bindschaedler
Comments: Accepted at the Deep Learning for Code (DL4C) Workshop at ICLR 2025
Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR); Software Engineering (cs.SE)

Code embeddings are essential for semantic code search; however, current approaches often struggle to capture the precise syntactic and contextual nuances inherent in code. Open-source models such as CodeBERT and UniXcoder exhibit limitations in scalability and efficiency, while high-performing proprietary systems impose substantial computational costs. We introduce a parameter-efficient fine-tuning method based on Low-Rank Adaptation (LoRA) to construct task-specific adapters for code retrieval. Our approach reduces the number of trainable parameters to less than two percent of the base model, enabling rapid fine-tuning on extensive code corpora (2 million samples in 25 minutes on two H100 GPUs). Experiments demonstrate an increase of up to 9.1% in Mean Reciprocal Rank (MRR) for Code2Code search, and up to 86.69% for Text2Code search tasks across multiple programming languages. Distinction in task-wise and language-wise adaptation helps explore the sensitivity of code retrieval for syntactical and linguistic variations. To foster research in this area, we make our code and pre-trained models publicly available.

[58] arXiv:2503.22935 (replaced) [pdf, html, other]
Title: Improving the Context Length and Efficiency of Code Retrieval for Tracing Security Vulnerability Fixes
Xueqing Liu, Jiangrui Zheng, Guanqun Yang, Siyan Wen, Qiushi Liu, Xiaoyin Wang
Subjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)

An upstream task for software bill-of-materials (SBOMs) is the accurate localization of the patch that fixes a vulnerability. Nevertheless, existing work reveals a significant gap in the CVEs whose patches exist but are not traceable. Existing works have proposed several approaches to trace/retrieve the patching commit for fixing a CVE. However, they suffer from two major challenges: (1) They cannot effectively handle long diff code of a commit; (2) We are not aware of existing work that scales to the full repository with satisfactory accuracy. Upon identifying this gap, we propose SITPatchTracer, a scalable and effective retrieval system for tracing known vulnerability patches. To handle the context length challenge, SITPatchTracer proposes a novel hierarchical embedding technique which efficiently extends the context coverage to 6x that of existing work while covering all files in the commit. To handle the scalability challenge, SITPatchTracer utilizes a three-phase framework, balancing the effectiveness/efficiency in each phase.
The evaluation of SITPatchTracer demonstrates it outperforms existing patch tracing methods (PatchFinder, PatchScout, VFCFinder) by a large margin. Furthermore, SITPatchTracer outperforms VoyageAI, the SOTA commercial code embedding LLM (\$1.8 per 10K commits) on the MRR and Recall@10 by 18\% and 28\% on our two datasets. Using SITPatchTracer, we have successfully traced and merged the patch links for 35 new CVEs in the GitHub Advisory database Our ablation study reveals that hierarchical embedding is a practically effective way of handling long context for patch retrieval.

[59] arXiv:2505.01892 (replaced) [pdf, html, other]
Title: OODTE: A Differential Testing Engine for the ONNX Optimizer
Nikolaos Louloudakis, Ajitha Rajan
Comments: 12 pages, 3 figures, 3 tables
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE); Systems and Control (eess.SY)

With over 700 stars on GitHub and being part of the official ONNX repository, the ONNX Optimizer is the default tool for applying graph-based optimizations to ONNX models. Despite its widespread use, its ability to maintain model accuracy during optimization has not been thoroughly investigated. In this work, we present OODTE, a utility designed to automatically and comprehensively evaluate the correctness of the ONNX Optimizer. OODTE adopts a straightforward yet powerful differential testing and evaluation methodology, which can be readily adapted for use with other compiler optimizers. Specifically, OODTE takes a collection of ONNX models, applies optimizations, and executes both the original and optimized versions across a user-defined input set, automatically capturing any issues encountered during optimization. When discrepancies in accuracy arise, OODTE iteratively isolates the responsible optimization pass by repeating the process at a finer granularity. We applied OODTE to 130 well-known models from the official ONNX Model Hub, spanning diverse tasks including classification, object detection, semantic segmentation, text summarization, question answering, and sentiment analysis. Our evaluation revealed that 9.2% of the model instances either caused the optimizer to crash or led to the generation of invalid models using default optimization strategies. Additionally, 30% of classification models and 16.6% of object detection and segmentation models exhibited differing outputs across original and optimized versions, whereas models focused on text-related tasks were generally robust to optimization. OODTE uncovered 15 issues-14 previously unknown-affecting 9 of 47 optimization passes and the optimizer overall. All issues were reported to the ONNX Optimizer team. OODTE offers a simple but effective framework for validating AI model optimizers, applicable beyond the ONNX ecosystem.

Total of 59 entries
Showing up to 2000 entries per page: fewer | more | all
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack